Removing Noise Content from Online News Articles
نویسندگان
چکیده
A typical news web page consists of news articles. Along with the news article content tags, it also contains tags of navigation links, privacy & copyright information and advertisements. These tags are called as noise tags. Given an online news article in html form, existing works extract articles by discovering informative tags using various heuristic techniques. In this paper, we follow an alternate approach that removes noise tags from the web page. In particular, we define two types of noise contents, namely static noise content and dynamic noise content, to be removed from a web page. Static noise content is the content which is present in all the article web pages of a same news website, whereas, the dynamic noise contents are advertisements and irrelevant hyperlinks which keep changing from one article web page to another web page. We present a two-stage approach for identification of static and dynamic noise content. After identifying tags with noise content, we remove these tags from the news webpage and thus only tags with article content will be left out. Experimental studies is done on news web pages extracted from 11 different news websites. Comparison with three well known open source content extraction techniques our approach suggest around 6% improvement in recall value with fair F1-score and precision value compared to best performing content extraction technique.
منابع مشابه
Automatically Identifying Good Conversations Online (Yes, They Do Exist!)
Online news platforms curate high-quality content for their readers and, in many cases, users can post comments in response. While comment threads routinely contain unproductive banter, insults, or users “shouting” over each other, there are often good discussions buried among the noise. In this paper, we define a new task of identifying “good” conversations, which we call ERICs—Engaging, Respe...
متن کاملDifferent News for Different Views: Political News-Sharing Communities on Social Media Through the UK General Election in 2015
Media exposure is a central concept in understanding the dynamics of public opinion and political change. Traditional models of media exposure have been severely challenged by the shift to online news consumption and news-sharing on social media. Here we use network analysis and automated content analysis to examine the interaction between news media and social media around the UK General Elect...
متن کاملThematic Progression in the Rhetorical Sections of an Online Iraqi English Newspaper
Abstract Thematic development refers to the way theme and rheme in the clause are developed. The theory of rhetorical structure can be defined as the strategies that follow specific ways to make writing more persuasive. The present study aimed to examine how Iraqi writers maintain cohesion in the text by analyzing the patterns of thematic progression in various rhetorical sections in an online ...
متن کاملThematic Progression in the Rhetorical Sections of an Online Iraqi English Newspaper
Abstract Thematic development refers to the way theme and rheme in the clause are developed. The theory of rhetorical structure can be defined as the strategies that follow specific ways to make writing more persuasive. The present study aimed to examine how Iraqi writers maintain cohesion in the text by analyzing the patterns of thematic progression in various rhetorical sections in an online ...
متن کاملBuzzer - Online Real-Time Topical News Article and Source Recommender
The significant growth of media and user-generated content online has allowed for the widespread adoption of recommender systems due to their proven ability to reduce the workload of a user and personalise content. In this paper, we describe our prototype system called Buzzer, which harnesses real-time micro-blogging activity, such as Twitter, as the basis for promoting personalised content, su...
متن کامل